[Day10] - Context：pl.DataFrame.select()與pl.DataFrame.with_columns() - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 10

Software Development

Polars熊霸天下系列第 10 篇

[Day10] - Context：pl.DataFrame.select()與pl.DataFrame.with_columns()

17th鐵人賽 python polars

Jerry Wu

2025-09-16 00:00:15

102 瀏覽

分享至

今天我們來學習如何使用pl.DataFrame.select()與pl.DataFrame.with_columns()，並會提到相關的expr知識。

本日大綱如下：

本日引入模組及準備工作
pl.DataFrame.select()
pl.DataFrame.with_columns()
快速選擇列的方法
使用pl.Expr.alias()或是關鍵字指定列名
快速指定多列列名
context內的expr為平行運算
codepanda

0. 本日引入模組及準備工作

import pandas as pd
import polars as pl

data = {"col1": [1, 2, 3], "col2": ["x", "y", "z"]}
df = pl.DataFrame(data)

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

1. `pl.DataFrame.select()`

pl.DataFrame.select()可以選擇原先dataframe中的列，例如選擇「"col1"」列：

df.select(pl.col("col1"))

shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

或選擇「"col1"」及「"col12"」列：

df.select(pl.col("col1"), pl.col("col2"))

# or
df.select(pl.col("col1", "col2"))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

也可以生成「"col3"」列：

df.select(pl.col("col1").alias("col3"))

shape: (3, 1)
┌──────┐
│ col3 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

這邊需留意，選擇結果只有「"col3"」列，這是因為pl.DataFrame.select()僅會根據其內包含的expr來生成列。我們將pl.col("col1").alias("col3")置於pl.DataFrame.select()中，相當於告訴Polars請幫我選擇「"col3"」列，其值與「"col1"」列相同。

pl.DataFrame.select()會依照expr給定的順序，來選擇列，例如下面這個例題，依序選擇了「"col3"」、「"col1"」及「"col2"」列：

df.select(pl.col("col1").alias("col3"), pl.col("col2"), pl.col("col3"))

shape: (3, 3)
┌──────┬──────┬──────┐
│ col3 ┆ col1 ┆ col2 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ 1    ┆ x    │
│ 2    ┆ 2    ┆ y    │
│ 3    ┆ 3    ┆ z    │
└──────┴──────┴──────┘

如果pl.DataFrame.select()內的部份expr於執行後僅會產生單一值，那麼Polars會很聰明地進行boardcast廣播（boardcast）。舉例來說，下面這個例子中的pl.col("col2")會是形狀(3, 1)的列，而pl.col("col1").mean()則是單一值，所以pl.DataFrame.select()會進行廣播將該所求值填入該列的每一行，使得該列形狀維持(3, 1)。

df.select(pl.col("col1").mean(), pl.col("col2"))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ str  │
╞══════╪══════╡
│ 2.0  ┆ x    │
│ 2.0  ┆ y    │
│ 2.0  ┆ z    │
└──────┴──────┘

但是當其內所有expr於執行後都僅會產生單一值，則會如實呈現，如：

df.select(pl.col("col1").mean(), pl.col("col2").first())

shape: (1, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ str  │
╞══════╪══════╡
│ 2.0  ┆ x    │
└──────┴──────┘

由以上觀察可知pl.DataFrame.select()所得到的dataframe，其行與列是可以與原先不相同。

2. `pl.DataFrame.with_columns()`

pl.DataFrame.with_columns()和pl.DataFrame.select()最大的區別是，pl.DataFrame.with_columns()會保留原先dataframe中的所有列，並將新添加的單或多列置於最後。例如：

df.with_columns(pl.col("col1").add(1).alias("col3"))

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ 2    │
│ 2    ┆ y    ┆ 3    │
│ 3    ┆ z    ┆ 4    │
└──────┴──────┴──────┘

這個例題裡，我們保留了原先的「"col1"」及「"col2"」列，並新增了一個「"col3"」列在最後，其值為「"col1"」加1。

如果新增的列名，與舊有的相同，則會將新值「貼在」原先的位置。例如：

df.with_columns(pl.col("col1").add(1))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 2    ┆ x    │
│ 3    ┆ y    │
│ 4    ┆ z    │
└──────┴──────┘

此例題中，pl.col("col1").add(1)的列名是「"col1"」，其值為舊「"col1"」列加1，列的位置不變。

需要特別留意的是，pl.DataFrame.with_columns()並不能保證選取順序。例如：

df2 = pl.DataFrame({"col2": [1, 2, 3], "col1": ["x", "y", "z"]})
df2.with_columns(pl.col("col1"), pl.col("col2"))

shape: (3, 2)
┌──────┬──────┐
│ col2 ┆ col1 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

從這個例題可以看出，pl.DataFrame.with_columns()並未依照「"col1"」及「"col2"」列的順序來選擇，而是依照我們先前提過的邏輯，即當新增列名與舊有相同時，僅會將新值貼在原先的列，而不會變動列的位置。

如若需要指定各列順序，應該使用pl.DataFrame.select()來達成：

df2.select(pl.col("col1"), pl.col("col2"))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ x    ┆ 1    │
│ y    ┆ 2    │
│ z    ┆ 3    │
└──────┴──────┘

此外，pl.DataFrame.with_columns()也支援廣播，例如新增「"col3"」列，其各行值皆為3，可以這麼寫：

df.with_columns(col3=3)

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i32  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ 3    │
│ 2    ┆ y    ┆ 3    │
│ 3    ┆ z    ┆ 3    │
└──────┴──────┴──────┘

最後，我們透過觀察下面例題，得知pl.DataFrame.with_columns()所得到的dataframe，僅有可能改變列的數量，卻無法改變行的數量。

df.with_columns(pl.col("col1").mean(), pl.col("col2").first())

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ str  │
╞══════╪══════╡
│ 2.0  ┆ x    │
│ 2.0  ┆ x    │
│ 2.0  ┆ x    │
└──────┴──────┘

pl.col("col1").mean()及pl.col("col2").first()都自動經由廣播填到各行。

3. 快速選擇列的方法

其實我們可以直接將列名或一個含有多個列名的列表傳入pl.DataFrame.select()或pl.DataFrame.with_columns()，而不必使用expr，例如：

df.select("col1", "col2")

# or
df.select(["col1", "col2"])

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

但這麼一來，由於這些列名不是expr，所以無法再對其進行其它運算，失去了所有expr提供的魔法，可以視為一種單純選擇列的語法糖。

4. 使用`pl.Expr.alias()`或是關鍵字指定列名

指定列名的寫法有以下兩種：

df.select(pl.col("col1").alias("col3"))

# or
df.select(col3=pl.col("col1"))

shape: (3, 1)
┌──────┐
│ col3 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

使用者可以自己選擇喜歡的命名方式。只是必須記得關鍵字命名，一樣受到Python本身的命名限制，例如第一個字母不能是數字或是不能包含空格等。

舉例來說，可以使用pl.Expr.alias()將「"col1"」列命名為「"1"」：

df.select(pl.col("col1").alias("1"))

shape: (3, 1)
┌─────┐
│ 1   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

但卻無法使用關鍵字命名將「"col1"」列命名為「"1"」：

❌
# SyntaxError: expression cannot contain assignment, 
# perhaps you meant "=="?
df.select(1=pl.col("col1"))

5. 快速指定多列列名

當一次選擇多列時，因為pl.Expr.alias()只能指定單一列名，所以無法使用它來指定多列列名。此時，可以考慮使用pl.Expr.name中提供的功能，如pl.Expr.name.prefix()來快速將多列列名加上前綴。例如將每個列前都加上「"new_"」，可以這麼寫：

df.select(pl.col("^col.*$").name.prefix("new_"))

shape: (3, 2)
┌──────────┬──────────┐
│ new_col1 ┆ new_col2 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ x        │
│ 2        ┆ y        │
│ 3        ┆ z        │
└──────────┴──────────

6. context內的expr為平行運算

由於context內的expr為平行運算，所以在同一個context中，我們無法引用之前新命名的列。例如，我們想新增「"col3"」及「"col4"」列：

❌
# ColumnNotFoundError: col3
(
    df.with_columns(
        pl.col("col1").add(1).alias("col3"),
        pl.col("col3").mul(2).alias("col4"),
    )
)

此例中pl.col("col1").add(1).alias("col3")是可以執行的expr，會生成「"col3"」列，但是我們無法立即於同一個context中引用。

此時我們必須使用兩個pl.DataFrame.with_columns()來完成新增「"col3"」及「"col4"」列：

(
    df.with_columns(
        pl.col("col1").add(1).alias("col3"),
    ).with_columns(pl.col("col3").mul(2).alias("col4"))
)

shape: (3, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ x    ┆ 2    ┆ 4    │
│ 2    ┆ y    ┆ 3    ┆ 6    │
│ 3    ┆ z    ┆ 4    ┆ 8    │
└──────┴──────┴──────┴──────┘

7. `codepanda`

Pandas中相對應於polars的pl.DataFrame.select()及pl.DataFrame.with_columns()中的功能是pd.DataFrame.assign()。由於其為依序執行，所以位於後方的運算，可以參考前方的計算結果。例如如果想新增「"col3"」及「"col4"」列：

df_pd = pd.DataFrame(data)

(
    df_pd.assign(
        col3=lambda df_: df_.col1.add(1), col4=lambda df_: df_.col3.mul(2)
    )
)

   col1 col2  col3  col4
0     1    x     2     4
1     2    y     3     6
2     3    z     4     8

請留意，在pd.DataFrame.assign()中，較後方的「"col4"」列引用了前方「剛」產生的「"col3"」列。這種能夠引用chaining過程中的資訊，是pd.DataFrame.assign()的一大特點。

最後，使用pd.DataFrame.assign()將可以幫助您徹底遠離惡名昭彰的SettingWithCopyWarning。

Code

本日程式碼傳送門。

[Day09] - Datatype：三種容器型別

[Day11] - Context：pl.DataFrame.filter()

系列文

Polars熊霸天下共 30 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19844 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Polars熊霸天下系列 第 10 篇